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Abstract 

We consider forecasting a single time series when there is a large number of pre¬ 
dictors and a possible nonlinear effect. The dimensionality was first reduced via a 
high-dimensional (approximate) factor model implemented by the principal compo¬ 
nent analysis. Using the extracted factors, we develop a novel forecasting method 
called the sufficient forecasting, which provides a set of sufficient predictive indices, 
inferred from high-dimensional predictors, to deliver additional predictive power. The 
projected principal component analysis will be employed to enhance the accuracy of 
inferred factors when a semi-parametric (approximate) factor model is assumed. Our 
method is also applicable to cross-sectional sufficient regression using extracted factors. 

The connection between the sufficient forecasting and the deep learning architecture 
is explicitly stated. The sufficient forecasting correctly estimates projection indices of 
the underlying factors even in the presence of a nonparametric forecasting function. 

The proposed method extends the sufficient dimension reduction to high-dimensional 
regimes by condensing the cross-sectional information through factor models. We de¬ 
rive asymptotic properties for the estimate of the central subspace spanned by these 
projection directions as well as the estimates of the sufficient predictive indices. We 
further show that the natural method of running multiple regression of target on es¬ 
timated factors yields a linear estimate that actually falls into this central subspace. 

Our method and theory allow the number of predictors to be larger than the number 
of observations. We finally demonstrate that the sufficient forecasting improves upon 
the linear forecasting in both simulation studies and an empirical study of forecasting 
macroeconomic variables. 

Key Words; Regression; forecasting; deep learning; factor model; semi-parametric factor 
model; principal components; learning indices; sliced inverse regression; dimension reduction. 
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1 Introduction 


Regression and forecasting in a data-rich environment have been an important research 
topic in statistics, economics and hnance. Typical examples include forecasts of a macroe¬ 


conomic output using a large number of employment and production variables (Stock and 


Watson, 

1989 

Bernanke et al., 

2005 

), forecasts of the values of market prices and dividends 

using cross-sectional asset returns ( 

Sharpe, 

1964 

Lintner 

1965 

), and studies of association 


between clinical outcomes and high throughput genomics and genetic information such as 
microarray gene expressions. The predominant framework to harness the vast predictive 
information is via the factor model, which proves effective in simultaneously modeling the 
commonality and cross-sectional dependence of the observed data. It is natural to think that 
the latent factors drive simultaneously the vast covariate information as well as the outcome. 
This achieves the dimensionality reduction in the regression and predictive models. 

Turning the curse of dimensionality into blessing, the latent factors can be accurately 
extracted from the vast predictive variables and hence they can be reliably used to build 
models for response variables. For the same reason, factor models have been widely employed 


in many applications, such as portfolio management (Fama and French, 1992; Carhart, 1997 


Fama and French, 2015 Hon et ah, 2015), large-scale multiple testing (Leek and Storey 


2008 Fan et al., 2012), high-dimensional covariance matrix estimation (Fan et ah, 2008 


2013), and forecasting using many predictors (Stock and Watson, 2002 ap). Leveraging on 


the recent developments on the factor models, in this paper, we are particularly interested 
in the problem of forecasting. Our techniques can also be applied to regression problems, 
resulting in sufficient regression. 

With little knowledge of the relationship between the forecast target and the latent 
factors, most research focuses on a linear model and its rehnement. Motivated by the classic 
principal component regression (Kendall, 1957 Hotelling, 1957), Stock and Watson ( 2002a|[6 ) 
employed a similar idea to forecast a single time series from a large number of predictors: 
hrst use the principal component analysis (PCA) to estimate the underlying common factors, 
followed by a linear regression of the target on the estimated factors. The key insight here is 
to condense information from many cross-sectional predictors into several predictive indices. 


Boivin and Ng (2006) pointed out that the relevance of factors is important to the forecasting 


power, and may lead to improved forecast. As an improvement to this procedure, Bair et al. 


(2006) used correlation screening to remove irrelevant predictors before performing the PCA. 


In a similar fashion, Bai and Ng (2008) applied boosting and employed thresholding rules 


to select “targeted predictors”, and Stock and Watson (2012) used shrinkage methods to 
downweight unrelated principal components. Kelly and Pruitt (2015) took into account the 
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covariance with the forecast target in linear forecasting, and they proposed a three-pass 
regression hlter that generalizes partial least squares to forecast a single time series. 

As mentioned, most of these approaches are fundamentally limited to linear forecasting. 
This yields only one index of extracted factors for forecasting. It does not provide sufficient 
predictive power when the predicting function is nonlinear and depends on multiple indices 
of extracted factors. Yet, nonlinear models usually outperform linear models in time series 
analysis (Tjpstheim, 1994), especially in multi-step prediction. When the link function be¬ 
tween the target and the factors is arbitrary and unknown, a thorough exploration of the 
factor space often leads to additional gains. To the best of our knowledge, there are only 
a few papers in forecasting a nonlinear time series using factor models. Bai and Ng (2008) 
discussed the use of squared factors (i.e., volatility of the factors) in augmenting forecasting 
equation. Ludvigson and Ng (2007) found that the square of the first factor estimated from 
a set of hnancial factors is signihcant in the regression model for mean excess returns. This 
naturally leads to the question of which effective factor, or more precisely, which effective 


direction of factor space to include for higher moments. Cunha et ah (2010) used a class 
of nonlinear factor models with shape constraints to study cognitive and noncognitive skill 
formation. The nonparametric forecasting function would pose a signihcant challenge in 
estimating effective predictive indices. 

In this work, we shall address this issue from a completely different perspective to exist¬ 
ing forecasting methods. We introduce a favorable alternative method called the sufficient 
forecasting. Our proposed forecasting method springs from the idea of the sliced inverse 


regression, which was hrst introduced in the seminal work by Li (1991). In the presence 
of some arbitrary and unknown (possibly nonlinear) forecasting function, we are interested 
in estimating a set of sufficient predictive indices., given which the forecast target is inde¬ 
pendent of unobserved common factors. To put it another way, the forecast target relates 
to the unobserved common factors only through these sufficient predictive indices. Such a 
goal is closely related to the estimation of the central subspace in the dimension reduction 
literature ( Cook| 2009). In a linear forecasting model, such a central subspace consists of 
only one dimension. In contrast, when a nonlinear forecasting function is present, the central 
subspace can go beyond one dimension, and our proposed method can effectively reveal mul¬ 
tiple sufficient predictive indices to enhance the prediction power. This procedure therefore 
greatly enlarges the scope of forecasting using factor models. As demonstrated in numer¬ 
ical studies, the sufficient forecasting has improved performance over benchmark methods, 
especially under a nonlinear forecasting equation. 

In summary, the contribution of this work is at least twofold. On the one hand, our 
proposed sufficient forecasting advances existing forecasting methods, and hlls the impor- 
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tant gap between incorporating target information and dealing with nonlinear forecasting. 
Our work identifies effective factors that influence the forecast target without knowing the 
nonlinear dependence. On the other hand, we present a promising dimension reduction tech¬ 
nique through factor models. It is well-known that existing dimension reduction methods are 
limited to either a hxed dimension or a diverging dimension that is smaller than the sample 


size (Zhu et ah, 2006). With the aid of factor models, our work alleviates what plagues 
sufficient dimension reduction in high-dimensional regimes by condensing the cross-sectional 
information, whose dimension can be much higher than the sample size. 

The rest of this paper is organized as follows. Section 2 hrst proposes the sufficient fore¬ 
casting using factor models, and then extends the proposed method by using semi-parametric 
factor models. In Section 3, we establish the asymptotic properties for the sufficient forecast¬ 
ing, and we also prove that the simple linear estimate actually falls into the central subspace. 
Section 4 demonstrates the numerical performance of the sufficient forecasting in simulation 
studies and an empirical application. Section 5 includes a few concluding remarks. Technical 
proofs are rendered in the Appendix. 


2 Sufficient forecasting 

This section hrst presents a unihed framework for forecasting using factor models. We 
then propose the sufficient forecasting procedure with unobserved factors, which estimates 
predictive indices without requiring estimating the unknown forecasting function. 


2.1 Factor models and forecasting 

Consider the following factor model with a target variable yt+i which we wish to forecast: 

yt+i = (2.1) 

Xit = b'h -h Uit, l<f<p, l<t<T, (2.2) 


where Xu is the Tth predictor observed at time t, hi is & K x 1 vector of factor loadings, 
fi = {fit, • • • , fKt)' is a iC X 1 vector of common factors driving both the predictors and the 
response, and uu is the error term, or the idiosyncratic component. For ease of notation. 


we write Xj = (x^, • • ■ ,Xpt)',B = (bi, • • ■ ,bp)' and u* = (uit, • • • ,Mpt)'. In (|2^, h{-) is 
an unknown link function, and e^+i is some stochastic error independent of and Uu- The 
vectors of linear combinations 0i, • ■ ■ , 4>l are iF-dimensional orthonormal vectors. Clearly, 
the model is also applicable to the cross-sectional regression such as those mentioned at 
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the introduction of the paper. Note that the sufficient forecasting can be represented in a 


deep learning architecture (Bengio, 2009; Bengio et al., 2013), consisting of four layers of 
linear/nonlinear processes for dimension reduction and forecasting. The connection between 
sufficient forecasting and deep learning is illustrated in Figure 1. An advantage of our deep 
learning model is that it admits scalable and explicit computational algorithm. 



Figure 1: The sufficient forecasting is represented as a four-layer deep learning architecture. The 
factors fit, - " , Iki are obtained from the inputs via principal component analysis. 


Model (2.1) is a semiparametric multi-index model with latent covariates and unknown 


nonparametric function h{-). The target yt+i depends on the factors f* only through L pre¬ 
dictive indices 0/h, ■ ■ ■ , 0/ft, where L is no greater than K. In other words, these predictive 


indices are sufficient in forecasting yt+i- Models (2.1) and (2.2) effectively reduce the di¬ 


mension from the diverging p to a hxed L to estimate the nonparametric function h{-), and 
greatly alleviates the curse of dimensionality. Here, 0/s are also called the sufficient dimen¬ 


sion reduction (SDR) directions, and their linear span forms the central subspace (Cook 


2009) denoted by Sy\f. Note that the individual directions 0i, • • ■ ,<Pl are not identihable 
without imposing any structural condition on h(-). However, the subspace Sy\f spanned by 
01) ■ ■ ■ ) 0L can be identihed. Therefore, throughout this paper, we refer any orthonormal 
basis 01 , • • • , of the central subspace Sy\f as sufficient dimension reduction directions, 
and their corresponding predictive indices ■ ■ ■ , as sufficient predictive indices. 


Suppose that the factor model (2.2) has the following canonical normalization 


cov(fi) = \k and B'B is diagonal. 


(2.3) 
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where is a ii' x identity matrix. This normalization serves as an identihcation condition 
of B and h because Bh = (Bf2)(f2~^h) holds for any nonsingular matrix f2. We assume 
for simplicity that the means of x^s and f^’s in (2.2) have already been removed, namely, 
E{xit) = 0 and = 0. 

As a special case, suppose we have a priori knowledge that the link function h(-) in (2.1) 
is in fact linear. In this case, no matter what L is, there is only a single index of the factors 
fj in the prediction equation, i.e. L = 1. Now, (2.1) reduces to 


Ut+i — 01 ft + Q+i- 


Such linear forecasting problems using many predictors have been studied extensively in the 
econometrics literature, for example. Stock and Watson (2002a|[6), Bai and Ng (2008), and 


Kelly and Pruitt (2015), among others. Existing methods are mainly based on the principal 


component regression (PCR), which is a multiple linear regression of the forecast target on 
the estimated principal components. The linear forecasting framework does not thoroughly 
explore the predictive power embedded in the underlying common factors. It can only create 
a single index of the unobserved factors to forecast the target. The presence of a nonlinear 
forecasting function with multiple forecasting indices would only deteriorate the situation. 


While extensive efforts have been made such as Boivin and Ng (2006), Bai and Ng (2009), 
and Stock and Watson (2012), these approaches are more tailored to handle issues of using 
too many factors, and they are fundamentally limited to the linear forecasting. 


2.2 Sufficient forecasting 

Traditional analysis of factor models typically focuses on the covariance with the forecast 
target cov(xt, and the covariance within the predictors x^, denoted by a p x p matrix 

Sa, = Bcov(ft)B' + 'Eu, (2.4) 


where is the error covariance matrix of uj. Usually cov(xi, yt+i) and are not enough to 
construct an optimal forecast, especially in the presence of a nonlinear forecasting function. 
To fully utilize the target information, our method considers the covariance matrix of the 


inverse regression curve, E{'x.t\yt+i)- By conditioning on the target yt+i in model (2.2), we 
obtain 

cov{E{xt\yt+i)) = Bcov{E{it\yt+i))B', 


where we used the assumption that E(ut|pt+i) 
any structure on the covariance matrix of u^. 


= 0. This salient feature does not impose 


Under model (2.1), Li (1991) showed that 
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E{it\yt+i) is contained in the central subspace S'^if spanned by provided that 

■ ■ ■ , is a linear function of • • ■ , for any b G This important 
result implies that Sy\{ contains the linear span of cov{E{ft\yt+i))- Thus, it is promising to 
estimate sufficient directions by investigating the top L eigenvectors of cov(£'(b||/i+i)). To 
see this, let $ = (0^, ■ ■ ■ ,<pi). Then, we can write 




(2.5) 


for some L x 1 coefficient function a(-), according to the aforementioned result by Li (1991). 
As a result, 

cov{E{ft\yt+i)) = ^E[a{yt+i)a{yt+i)'^]^'. (2.6) 

This matrix has L nonvanishing eigenvalues if E[sL{yt+i)sL{yt+i)'^] is non-degenerate. Their 
corresponding eigenvectors have the same linear span as 0^, • • ■ 

It is difficult to directly estimate the covariance of the inverse regression curve E{ft\yt^i) 
and obtain sufficient predictive indices. Instead of using the conventional nonparametric 


regression method, we follow Li (1991) to explore the conditional information of these un¬ 


derlying factors. To this end, we introduce the sliced covariance estimate 

1 ^ 

'^f\y = ^ h)E{fl\yt+i e h), (2.7) 

h=l 

where the range of yt+i is divided into H slices Ii,... ,Ih such that P{yt+i E Ih) = ^/H. By 


(2.5), we have 


^f\y = ^ 


H 


H 


'^E{a{yt+i)\yt+i e Ih)E{a{yt+i)\yt+i G hf 


h=l 




( 2 . 8 ) 


This matrix has L nonvanishing eigenvalues as long as the matrix in the bracket is non¬ 
degenerate. In this case, the linear span created by the eigenvectors of the L largest eigen¬ 
values of is the same as that spanned by 0i, ■ ■ ■ ,<Pl- Note that H > max{L, 2} is 
required in order for the matrix in the bracket to be non-degenerate. 


To estimate in (2.7) with unobserved factors, a natural solution is to hrst consistently 


estimate factors h from the factor model (2.2) and then use the estimated factors h and the 


observed target yt+i to approximate the sliced estimate Alternatively, we can start 

with the observed predictors x*. Since E{ut\yt^i) = 0, we then have 


E{:>ct\yt+i e 4) = BE{ft\yt+i e 4), 
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and 

E{ft\yt+i e 4) = AfeE(xt|j/t+i e 4), 
where A;, = (B'B)^^B'. Hence, 

1 ^ 

S/ij; = ^bE{xt\yt+i e 4)^(x'Ji/i+i e lh)K, (2.9) 

h=l 


Notice that A;, is also unknown and needs to be estimated. With a consistent estimator A;,, 
we can nse xj and yt+i to approximate the sliced estimate S/|y. 

in 


In the sequel, we present two estimators, 


f\y 


(2.12) and in (2.13), estimated 


by using respectively (2.7) and (2.9). Interestingly, we show that these two estimators are 


exactly equivalent when factors and factor loadings are estimated from the constrained least 
squares. We denote two equivalent estimators and by As in Remark 2.1 

these two are in general not equivalent. 

In Section 3, we shall show that under mild conditions, is consistently estimated 
by S/Ij, as p, T — )■ cx). Furthermore, the eigenvectors of S/|y corresponding to the L largest 
eigenvalues, denoted as (j = I,-- - ,L), will converge to the corresponding eigenvec¬ 
tors of Sjiy, which actually span the aforementioned central subspace Sy\f. This will yield 
consistent estimates of sufficient predictive indices Given these estimated 

low-dimensional indices, we may employ one of the well-developed nonparametric regression 
techniques to estimate h{ ) and make the forecast, including local polynomial modeling, re¬ 


gression splines, additive modeling, and many others (Green and Silverman, 1993 Fan and 


Gijbels, 1996| Li and Racine, 2007). We summarize the sufficient forecasting procedure in 


Algorithm 


Algorithm 1 Sufficient forecasting using factor models 


Step 1: Obtain the estimated factors from (2.10) and (2.11); 

Step 2: Gonstruct as in (2.12) or ( |2.13| ); 

Step 3: Obtain • • ■ , from the L largest eigenvectors of 

Step 4: Gonstruct the predictive indices • • • , and forecast yt+i- 


To make the forecasting procedure concrete, we elucidate how factors and factor loadings 
are estimated. We temporarily assume that the number of underlying factors K is known to 


























US. Consider the following constrained least squares problem: 


r''ll2 


(B;^,Fi^) =argmin ||X-BF||^, 

(B,F) 

subject to T“^F'F = Ik, 


B'B is diagonal, 


( 2 . 10 ) 

( 2 . 11 ) 


where X = (xi, • • • ,X(r), F' = (fi, • • ■ ,f(r), and || ■ \\f denotes the Frobenius norm. This is 
a classical principal components problem, and it has been widely used to extract underlying 


common factors (Stock and Watson, 2002a Bai and Ng, 2002 Fan et al., 2013 Bai and Ng 


2013). The constraints in (2.11) correspond to the normalization (2.3). The minimizers Yk 


and B/f are such that the columns of F^/\/T are the eigenvectors corresponding to the K 
largest eigenvalues of the T x T matrix X'X and 


Bk = T-'XF 


K- 


To simplify notation, we use B = Bk^ F = F^ and F' = (fi, • ■ ■ , fr) throughout this paper. 


To effectively estimate 'Ylf\y in (2.7) and (2.9), we shall replace the conditional expecta¬ 
tions E(Jt\yt+i € Ih) and E{x.t\yt+i € Ih) by their sample counterparts. Denote the ordered 
statistics of {{yt+iX)]t=i,...,T-i by {(yp+i), according to the values of y, where 

y( 2 ) < • • • < 2/(t) and we only use information up to time T. We divide the range of y into El 
slices, where H is typically hxed. Each of the hrst Ef — 1 slices contains the same number of 
observations c > 0, and the last slice may have less than c observations, which exerts little 
influence asymptotically. For ease of presentation, we introduce a double script (^hj) in which 
h = 1,... ,E[ refers to the slice number and j = 1,...,c is the index of an observation in a 
given slice. Thus, we can write {(l/p+i), f(p)}t=i,.,.,r-i as 

{iy{h,j)-i^{h,j)) ■ y(h,j) = y{c(h-l)+j+l)i = f{c{h-l)+j)}h=l,...,H-, j=l,...,c- 
Based on the estimated factors b, we have the estimate in the form of 


h=l 1=1 l=l 


( 2 . 12 ) 


Analogously, by using (2.9), we have an alternative estimator: 


1 Ad 




H 

h=l 1=1 


(2.13) 


i=l 
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where = (B'B) ^B is the plug-in estimate of A;,. Two estimates of are seemingly 
based on estimated factors and based on estimated loadings. When 


different: 


factors and loadings are estimated using ( 2 . 10 ) and ( 2 . 11 ), the following proposition shows 
that and are the same. 

Proposition 2.1. Let ft and B be estimated by solving the constrained least square problem 


in (2.10) and (2.11). Then, the two estimators (2.12) and (2.13) ofT^py are equivalent, i.e 


_ ^2 

^f\y - ^f\y 


Remark 2.1. There are alternative ways for estimating factors F and factor loadings B. 


For example, Forni et al. (2000, 2015) studied factor estimation based on projection, Liu 


et al. (2012), Xue and Zou (2012) and Fan et al. (2015b) employed rank-based methods for 


robust estimation of the covariance matrices, and Connor et al. (2012) applied a weighted 


additive nonparametric estimation procedure to estimate characteristic-based factor models. 
These methods do not necessarily lead to the equivalence of and as in Proposition 


f\y 


2.1 However, the proposed sufficient forecasting will benefit from using generalized dynamic 


factor models (Forni et al., 2000, 2015) or more robust factor models. This would go beyond 


the scope of this paper, and is left for future research. 


2.3 Sufficient forecasting with projected principal components 

Often in econometrics and hnancial applications, there may exist explanatory variables 


to model the loading matrix of factor models (Connor et ah, 2012; Fan et ah, 2015a). In 


such cases, the loading coefficient bj in (2.15) can be modeled by nonparametric functions 


bj = g(zi) -|- 7 j, where Zj denotes the vector of d subject-specihc covariates, and 7 ^ is the part 
of bj that can not be explained by the covariates z*. Suppose that 7 j’s are independently 
identically distributed with mean zero, and they are independent of Zj, uu, and et+i. 

Now we consider the following semi-parametric factor model to forecast yt+i'- 


yt+i — ■ ■ ■, ct+i), 

Xit = (g(zi) -F 7 j'fi -h Uit, l<i<p, l<t<T, 


(2.14) 

(2.15) 


where <pi,... ,0^, et+i, h, and uu follow the same assumptions in Section 2.1. Note that 


(2.15) has the matrix form of X = (G(Z) -I- r)F' -|- U, which decomposes the factor loading 
matrix B into the subject-specihc component G(Z) and the orthogonal residual component 


F. Following Fan et al. (2015a), the unobserved latent vectors h in the semi-parametric factor 
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model (2.15) can be better estimated by an order of magnitnde, as long as the contribntions 


of Zj are gennine. 


Given covariates Zj, we first nse the projected PGA (Fan et ah, 2015a) to estimate 


latent factors. Projected PGA is implemented by projecting the predictors X onto the 
sieve space spanned by {zi,... ,Zp} and applying the conventional PGA on the projected 
predictors X. As an example, we nse additive spline model to approximate the fnnc- 
tion g in (2.15). Let (pi, ■ ■ ■ , (pj is a set of J basis fnnctions. Then, we dehne c^(zj) = 
..., (pj{zii), ■ ■ • , (pi{zid ),..., (pj{zid))', and ip{Z) = (<^(zi), • ■ ■ , <^(zp))' as a p by Jd 
design matrix, created by htting additive spline models. Then, the projected predictors (or 
smoothed predictors by nsing additive model) is 


X = PX, 


(2.16) 


where P = ‘p(Z) (cp(Z)'cp(Z))~^ cp(Z)'X is the projection matrix onto the sieve space. The 
projected PGA is to rnn PGA on the projected matrix X to extract the latent factor F. The 
advantage of this is that F is estimated far more accnrately, according to 


Fan et ah 


(2015a). 


Note that the projection matrix P was constructed by using the additive spline model. It 
can also be constructed by using the multivariate kernel function. 

After obtaining F, we simply follow Section 2.2 to obtain SDR directions i/’i) ‘ ‘ 
and sufficient predictive indices • • • , to forecast forecast yt+i- 


2.4 Choices of tuning parameters 

Sufficient forecasting includes determination of the number of slices if, the number of 
predictive indices L, and the number of factors K. In practice, H has little influence on the 


estimated directions, which was hrst pointed out in Li (1991) and can be seen from (2.8). 


The choice of H differs from that of a smoothing parameter in nonparametric regression, 
which may lead to a slower rate of convergence. As shown in Section 3, we always have the 
same rate of convergence for 0^ no matter how H is specihed in the sufficient forecasting. 


The reason is that (2.8) gives the same eigenspace as long as H > max{L, 2}. 


In terms of the choice of L, the hrst L eigenvalues of must be signihcantly different 


from zero compared to the estimation error. Several methods such as Li (1991) and Schott 


(1994) have been proposed to determine L. For instance, the average of the smallest K — L 


eigenvalues of would follow a y^-distribution if the underlying factors are normally 
distributed, where K is the number of underlying factors. Should that average be large, 
there would be at least L + 1 predictive indices in the sufficient forecasting model. 

Pertaining to many factor analysis, the number of latent factors K might be unknown 
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to US. There are many existing approaches to determining K in the literature, e.g., Bai and 


Ng (2002), Hallin and Liska (2007), Alessi et ah (2010). Recently, Lam and Yao (2012) and 


Ahn and Horenstein (2013) proposed a ratio-based estimator by maximizing the ratio of two 


adjacent eigenvalues of X'X arranged in descending order, i.e. 


K = argmax A^/Ai+i, 

l<i<kmax 


where Ai > ■ ■ ■ > A^ are the eigenvalues. The estimator enjoys good hnite-sample perfor¬ 
mances and was motivated by the following observation: the K largest eigenvalues of X'X 


grow unboundedly as p increases, while the others remain bounded. Fan et ah (2015a) pro¬ 


posed a similar ratio-based estimator for estimating the number of factors in semiparametric 
factor models. We note here that once a consistent estimator of K is found, the asymptotic 
results in this paper hold true for the unknown K case by a conditioning argument. Unless 
otherwise specihed, we shall assume a known K in the sequel. 


3 Asymptotic properties 

We dehne some necessary notation here. For a vector v, ||v|| denotes its Euclidean norm. 
For a matrix M, ||M|| and ||M||i represent its spectral norm and norm respectively, 
dehned as the largest singular value of M and its maximum absolute column sum. ||M||oo 
is its matrix £00 norm dehned as the maximum absolute row sum. For a symmetric matrix 
M, ||M||i = ||M||oo. Denote by Amin(-) and Amax(-) the smallest and largest eigenvalue of a 
symmetric matrix respectively. 


3.1 Assumptions 


We hrst detail the assumptions on the forecasting model (2.1) and the associated fac¬ 
tor model (2.2), in which common factors {ft}t=i,...,r and the forecasting function h{-) are 
unobserved and only are observable. 


Assumption 3.1 (Factors and Loadings). For some M > 0, 

(1) The loadings bj satisfy that ||bj|| < M for i = 1,... ,p. As p ^ 00 , there exist two 
positive constants ci and C 2 such that 

Cl < Amin(-B'B) < Ainax(“B'B) < C2. 
p p 


(2) Identification: T ^F'F 


Ik, o,nd B'B is a diagonal matrix with distinct entries. 
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(3) Linearity: E(h'ft\<p[ft, ■ ■ ■ , <pL^t) is a linear function of ■ ■ ■ , 4>L^t for any b G 
where 4>^ ’s come from model (2.1^. 


Condition (1) is often known as the pervasive condition (Bai and Ng, 2002 Fan et al. 


2013) such that the factors impact a non-vanishing portion of the predictors. Condition (2) 


corresponds to the PCI condition in Bai and Ng (2013), which eliminates rotational indert- 


erminacy in the individual columns of F and B. Condition (3) ensures that the (centered) 
inverse regression curve E{ft\yt+i) is contained in the central subspace, and it is standard in 
the dimension reduction literature. The linearity condition is satisfied when the distribution 


of ft is elliptically symmetric (Eaton, 1983), and it is also asymptotically justified when the 


dimension of f* is large ( |Hall and Li 1993). 

We impose the strong mixing condition on the data generating process. Let and 
denote the a—algebras generated by {(f*, u^, ct+i) : f < 0} and {(ft, Uj, et+i) : t > T} 
respectively. Define the mixing coefficient 


a(T) = sup \P{A)P{B) - P{AB)\. 


Assumption 3.2 (Data generating process). {ft}i>i, and are three strictly 

stationary processes, and they are mutually independent. The factor process also satisfies that 
£'||fi||'^ < oo and E(||ft|p|j/t+i) < oo. In addition, the mixing coefficient a{T) < cp^ for all 
T E TA and some p E (0,1). 


The independence between and (or {et+i}i>i) is similar to Assumption 

A(d) in Bai and Ng (2013). The independence between and {et+i}t>i, however, can 

be relaxed, making the data generation process more realistic. For example, we just need to 
assume that E{ut\yt+i) = 0 for f > 1 for the subsequent theories to hold. Nonetheless, we 
stick to this simplified assumption for clarity. 

Moreover, we make the following assumption on the residuals and dependence of the 
factor model (2.2). Such conditions are similar to those in Bai (2003), which are needed to 
consistently estimate the common factors as well as the factor loadings. 


Assumption 3.3 (Residuals and Dependence). There exists a positive constant M < oo 
that does not depend on p and T, such that 

(1) E(ut) = 0, and E\uitf < M. 

(2) ||S«||i < M, and for every i,j,t,s > 0, {pT)~^Y.i,j,t,s \E{uitUjs)\ < M 

(3) For every {t,s), E|p“^/^(u(ui — E{usUt))\‘^ < M. 


13 































3.2 Rate of convergence 


The following theorem establishes the convergence rate of the sliced covariance estimate 
of inverse regression curve (i.e. as in (2.12) or ( 2.13[ )) under the spectral norm. It 

further implies the same convergence rate of the estimated SDR directions associated with 
sufficient predictive indices. For simplicity, we assume that K and H are hxed, though they 
can be extended to allow K and H to depend on T. This assumption enables us to obtain 
faster rate of convergence. Note that the regression indices and the PCA directions 

{'^i}i=i ^'I’e typically different, but they can span the same linear space. 


Theorem 3.1. Suppose that Assumptions 3.1 3.3 hold and let uip^T = P + T Then 


under the forecasting model (2.1) and the associated factor model (2.2), we have 


l^/h ^/lyll Opiojp^T)- 


(3.1) 


If the L largest eigenvalues oflllpy are positive and distinct, the eigenvectors 
associated with the L largest eigenvalues o/Sj|y give the consistent estimate of directions 
'ipi, - ■ ■ , 'ipL respectively, with rates 


Wi - V’jll = Op(Wp,T). 


(3.2) 


for j = 1, - ■ ■ ,L, where i/^i, • • ■ , form an orthonormal basis for the central subspace Sy\f 
spanned by - ■ ■ ,<pp^. 


When L = 1 and the link function h(-) in model (2.1) is linear. Stock and Watson (2002a) 


concluded that having estimated factors as regressors in the linear forecasting does not affect 


consistency of the parameter estimates in a forecasting equation. Theorem 3T extends the 
theory, showing that it is also true for any unknown link function h{-) and multiple predictive 
indices by using the sufficient forecasting. The two estimates coincide in the linear forecasting 
case, but our procedure can also deal with the nonlinear forecasting cases. 


Remark 3.1. Combining Theorem 3.1 and Fan et ah (2015a), we could obtain the rate of 


convergence for sufficient forecasting using semiparametric factor models (i.e., (2.14)-(2.15)). 
Under some conditions, when J = (pmin{T, p, and p,T —)■ cx), we have 

W^fly - '^flyW = Op{p~^^^ ■ {pmm{T,p,v-^})^-^ 


where k > 4 is a constant characterizing the accuracy of sieve approximation (see Assumption 
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4.3 of 


Fan et al. 


(2015a)), and Vp = max^ - ■ Then, for j = 1, • • • , L, we have 


- tpAl = Op{p ■ (pmin{T,p,n/}) 2 K h+T 


where • • ■ , i/’l form an orthonormal basis for F^^if, when the L largest eigenvalnes of 
are positive and distinct. It is obvious that (pmin{T,p, 0 even when T is hnite. 

Thus, with covariates Z, the sufficient forecasting achieves a faster rate of convergence by 


using the projected PCA (Fan et ah, 2015a). For space consideration, we do not present 
technical details in this paper. 


As a consequence of Theorem 3.1, the sufficient predictive indices can be consistently 
estimated as well. We present this result in the following corollary. 

Corollary 3.1. Under the same conditions of Theorem EH for any j = 1,..., L, we have 


-^p (3.3) 

Translating into the original predictors and letting we then have 

^'•xt ^p (3.4) 


Corollary 3.1 further implies the consistency of our proposed sufficient forecasting by 


using the classical nonparametric estimation techniques (Green and Silverman, 1993 Fan 
and Gijbels, 1996 [LTand Racine , 2007). 


Remark 3.2. In view of (3.3) and (3.4) in Gorollary 3.1, the sufficient forecasting hnds 

not only the sufficient dimension reduction directions 'lAi,--- , estimated factors ft 

but also the sufficient dimension reduction directions ^i, • • ■ predictor x^. The suf- 

hcient forecasting effectively obtains the projection of the p-dimensional predictor x* onto 

■ —■/ 

the L-dimensional subspace ^iXt, • • ■ by using the factor model, where the number of 

predictors p can be larger than the number of observations T. It is well-known that the tra¬ 
ditional sliced inverse regression and many other dimension reduction methods are limited to 


either a fixed dimension or a diverging dimension that is smaller than the sample size (Zhu 


et al., 2006). By condensing the cross-sectional information into indices, our work alleviates 


what plagues the sufficient dimension reduction in high-dimensional regimes, and the factor 


structure in (2.1) and (2.2) effectively reduces the dimension of predictors and extends the 


methodology and applicability of the sufficient dimension reduction. 
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3.3 Linear forecasting under link violation 

Having extracted common factors from the vast predictors, the PCR is a simple and 
natural method to run a multiple regression and forecast the outcome. Such a linear esti¬ 
mate is easy to construct and provides a benchmark for our analysis. When the underlying 
relationship between the response and the latent factors is nonlinear, directly applying linear 


forecasts would violate the link function h{-) in (2.2). But what does this method do in the 
case of model misspecihcation? In what follows, we shall see that asymptotically the linear 
forecast falls into the central subspace Sy\f, namely, it is a weighted average of the sufficient 
predictive indices. 


With the underlying factors {ft} estimated as in (2.10) and (2.11), the response j/t+i is 
then regressed on ft via a multiple regression. In this case, the design matrix is orthogonal due 


to the normalization (2.11). Therefore, the least-squares estimate of regression coefficients 


from regressing yt+i on ft is 


T-l 


4 > = 


T-l 




(3.5) 


t=i 


To study the behavior of regression coefficient 0, we shall assume normality of the un¬ 
derlying factors ft. The following theorem shows that, regardless of the specihcation of the 
link function h(-), 0 falls into the central subspace S'^if spanned by 0^, • • • ,(f)i as p,T —)■ oo. 
More specihcally, it converges to the sufficient direction 


0 = 


2=1 


which is a weighted average of sufficient directions 0i, • • ■ , 4>l- In particular, when the link 
function is linear and L = 1, the PCR yields an asymptotically consistent direction. 


Theorem 3.2. Consider model (2.1) and (2.2) under assumptions of Theorem 3.1. Suppose 
the factors f* are normally distributed and that E{ii() < oo. Then, we have 


<l>\\ = 


(3.6) 


As a consequence of Theorem 3.2, the linear forecast 0 f^ is a weighted average of the 


sufficient predictive indices 0}ft, • ■ ■ , 4>L^t- 


Corollary 3.2. Under the same conditions of Theorem 3^. we have 

L 

$'fi = T. 


(3.7) 


2=1 
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It is interesting to see that when L = 1, 0 = E As a result, the least 
squares coefficient 0 in (3.5), using the estimated factors as covariates, delivers an asymp¬ 
totically consistent estimate of the forecasting direction 0^ up to a multiplicative scalar. In 
this case, we can hrst get a scatter plot for the data {(0 f*, yt+i)}t=i check if the linear £t 
suffices and then employ a nonparametric smoothing to get a better £t if necessary. This can 
correct the bias of PCR, which delivers exactly one index of extracted factors for forecasting. 

When there exist multiple sufficient predictive indices (i.e., L >2), the above robustness 
no longer holds. The estimated coefficient 0 belongs to the central subspace, and it only 
reveals one dimension of the central subspace leading to limited predictive power. The 
central subspace S'y|f, in contrast, is entirely contained in as in (2.7). By estimating 
Sjiy, the sufficient forecasting tries to recover all the effective directions and would therefore 
capture more driving forces to forecast. 

As will be demonstrated in Section 4, the proposed method is comparable to the lin¬ 
ear forecasting when L = 1, but signihcantly outperforms the latter when L > 2 in both 
simulation studies and empirical applications. 


4 Numerical studies 

In this section, we conduct both Monte Carlo experiments and an empirical study to 
numerically assess the proposed sufficient forecasting. Sections 4.1-4.3 include three different 
simulation studies, and Section 4.4 shows the empirical study. 


4.1 Linear forecasting 


We hrst consider the case when the target is a linear function of the latent factors plus 
some noise. Prominent examples include asset return predictability, where we could use the 


cross section of book-to-market ratios to forecast aggregate market returns (Campbell and 


Shiller, 1988 Polk et ah, 2006 Kelly and Pruitt, 2013). To this end, we specify our data 


generating process as 


Vt+l — 0 ft + 

^it bift T 

where we let 77 = 5 and 0 = (0.8, 0.5, 0.3, 0, 0)', namely there are 5 common factors driving 
the predictors and a linear combination of them predicting the outcome. Factor loadings 
bj’s are drawn from the standard normal distribution. To account for serial correlation, we 
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set fjt and as two AR(1) processes respectively, 

fjt A CjO Uit PiUit—\ “1“ l^it- 

We draw aj,pi from ~ f/[0.2,0.8] and fix them during simulations, while the noises ejtjh'u 
and e^+i are standard normal respectively, ay is taken as the variance of the factors, so that 
the infeasible best forecast using <p'it has an of approximately 50%. 


Table 1: Performance of forecasts using in-sample and out-sample 


p 

T 

In-sample R? 
SF(1) PCR 

(%) 

PCI 

Out-of-sample R? 
SF(1) PCR 

(%) 

PCI 

50 

100 

46.9 

47.7 

7.8 

35.1 

39.5 

2.4 

50 

200 

46.3 

46.5 

6.6 

42.3 

41.7 

4.4 

100 

100 

49.3 

50.1 

8.9 

37.6 

40.3 

3.0 

100 

500 

47.8 

47.8 

5.5 

43.6 

43.5 

1.1 

500 

100 

48.5 

48.8 

7.9 

40.0 

43.1 

4.7 

500 

500 

48.2 

48.3 

7.2 

48.0 

47.9 

6.0 


Notes: In-sample and out-of-sample median R^, recorded in percentage, over 
1000 simulations. SF(1) denotes the sufficient forecast using one single predic¬ 
tive index, PCR denotes principal component regression, and PCI uses only 
the first principal component. 


We examine both the in-sample and out-of-sample performance between the principal 
component regression and the proposed sufficient forecasting. Akin to the in-sample R^ typ¬ 


ically used in linear regression, the out-of-sample R^, suggested by Campbell and Thompson 

El 


(2008), is defined here as 


-^os ~ 1 


--[T/2] 


{yt - yt) 


Elp 


(4.1) 


=[T/2]iyt y^y 

where in each replication we use the second half of the time series as the testing sample, yt 
is the fitted value from the predictive regression using all information before t and yt is the 
mean from the test sample. Note that Rq^ can be negative. 

Numerical results are summarized in Table where SF(1) denotes the sufficient forecast 
using only one single predictive index, PCR denotes the principal component regression, and 
PCI uses only the first principal component. Note that the first principal component is not 
necessarily the same as the sufficient regression direction cj). Hence, PCI can perform poorly. 
As shown in Table the sufficient forecasting SF(1), by hnding one single projection of the 
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latent factors, yields comparable results as PCR, which has a total of five regressors. The 


good performance of SF(1) is consistent with Theorem 3.1, whereas the good performance 


of PCR is due to the fact that the true predictive model is linear (see Stock and Watson 


(2002a) and Theorem 3.2). In contrast, using the first principal component alone has very 
poor performance in general, as it is not the index for the regression target. 


4.2 Forecasting with factor interaction 

We next turn to the case when the interaction between factors is present. Interaction 
models have been employed by many researchers in both economics and statistics. For 


example, in an influential work, Rajan and Zingales (1998) examined the interaction between 


hnancial dependence and economic growth. Consider the model 


Ut+l — /lt(/2i + /si + 1) + 


where et+i is taken to be standard normal. The data generating process for the predictors 
Xits is set to be the same as that in the previous simulation, but we let K=7, i.e., 7 factors 
drive the predictors with only the hrst 3 factors associated with the response (in a nonlinear 
fashion). The true sufficient directions are the vectors in the plane Sf\y generated by 0^ = 
(1, 0, 0, 0,0, 0, 0)' and 02 = (0) 1) 1) O 5 O 5 O 5 0)'/-\/2. Following the procedure in Algorithm 
we can hnd the estimated factors f* as well as the forecasting directions 0^ and 02 . 

The forecasting model only involves a strict subset of the underlying factors. Had we 
made a linear forecast, all the estimated factors would likely be selected, resulting in issues 
such as use of irrelevant factors (i.e., only the hrst three factors suffice, but the hrst 7 factors 
are used). This is evident through the correlations between the response and the estimated 
factors, as in Table and results in both bias (htting a wrong model) and variance issue. 


Table 2: |Corr(|/t+i, fit)| 


p 

T 

fii 

f2i 


f4t 

fst 


fit 

500 

500 

0.1819 

0.1877 

0.1832 

0.1700 

0.1737 

0.1712 

0.1663 


Notes: Median correlation between the forecast target and the estimated 
factors in absolute values, based on 1000 replications. 

Next, we compare the numerical performance between the principal component regression 
and the proposed sufficient forecasting. As in Section 4.1, we again examine their in-sample 
and out-of-sample forecasting performances using the in-sample and the out-of-sample 
respectively. Furthermore, we will measure the distance between any estimated regression 
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direction (j) and the central subspace Sf\y spanned by (pi and 02 - W^e ensure that the true 
factors and loadings meet the identihability conditions by calculating an invertible matrix 
H such that ^HF'FH' = Ik and is diagonal. The space we are looking for is 

then spanned by and notation simplicity, we still represent the rotated 

forecasting space as Sf\y. We employ the squared multiple correlation coefficient R‘^{4>) to 
evaluate the effectiveness of 0, where 


/? 2 ( 0 ) = 


max 


0e5y|,„||0li=i 


(0 0 )^. 


This measure is commonly used in the dimension reduction literature; see Li (1991). We 


calculate R‘^{<p) for the sufficient forecasting directions (pi, 02 and the PCR direction 0 


per • 


Table 3: Performance of estimated index 0 using R‘^{4>) (%) 


Sufficient forecasting PCR 


p 

T 

rHP,) 

R'(02) 

R^iPper) 

100 

100 

84.5 (16.3) 

64.4 (25.0) 

91.4 (9.1) 

100 

200 

92.9 (7.8) 

83.0 (17.5) 

95.7 (4.6) 

100 

500 

96.9 (2.2) 

94.1 (5.0) 

96.7 (1.6) 

500 

100 

85.4 (15.9) 

57.9 (24.8) 

91.9 (8.8) 

500 

200 

92.9 (6.7) 

82.9 (16.1) 

95.9 (4.1) 

500 

500 

97.0 (2.0) 

94.5 (4.5) 

98.2 (1.5) 


Notes: Median squared multiple correlation coefficients in percentage over 
1000 replications. The values in parentheses are the associated standard 
deviations. PCR denotes principal component regression. 
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Table 4: Performance of forecasts using in-sample and out-sample B? 


In-sample (%) Out-of-sample (%) 


p 

T 

SFi 

PCR 

PCRi 

SFi 

PCR 

PCRi 

100 

100 

46.2 

38.5 

42.4 

20.8 

12.7 

13.5 

100 

200 

57.7 

35.1 

38.6 

41.6 

24.0 

24.7 

100 

500 

77.0 

31.9 

34.9 

69.7 

29.1 

31.5 

500 

100 

49.5 

38.7 

22.7 

21.9 

16.6 

19.2 

500 

200 

58.9 

34.7 

39.0 

40.2 

22.2 

24.0 

500 

500 

79.8 

32.5 

35.6 

72.3 

26.9 

28.2 


Notes: In-sample and out-of-sample median R^ in percentage over 1000 repli¬ 
cations. SFi uses first two predictive indices and includes their interaction 
effect; PCR uses all principal components; PCRi extends PCR by including 
an extra interaction term built on the first two principal components of the 
covariance matrix of predictors. 


In the sequel, we draw two conclusions from the simulation results summarized in Tables 
and [phased on 1000 independent replications. 

Firstly, we observe from Table that the length of time-series T has an important effect 
on and As T increases from 100 to 500, all the squared multiple 

correlation coefficients increase too. In general, the PCR and the proposed sufficient fore¬ 
casting perform very well in terms of the measure The good performance of R^{cj)p^) 

is a testimony of the correctness of Theorem 3^ However, the PCR is limited to the linear 
forecasting function, and it only picks up one effective sufficient direction. The sufficient fore¬ 
casting successfully picks up both effective sufficient directions in the forecasting equation, 
which is consistent to Theorem 13.11 

Secondly, we evaluate the power of sufficient forecasting using the two predictive indices 


(piU and 02^0 Similar to the previous section, we build a simple linear regression to forecast, 
but include three terms, ^^fi, 02^2 and (0^fi) • ( 02 ^ 2 ), as regressors to account for interaction 
effects. In-sample and out-of-sample R? are reported. For comparison purposes, we also 
report results from principal component regression (PCR), which uses the 7 estimated factors 
extracted from predictors. In addition, we add the interaction between the first two principal 
components on top of the PCR, and denote this method as PCRi. Note that the principal 
component directions differ from the sufficient forecast directions, and it does not take into 
account the target information. Therefore, PCRi model is very different from the sufficient 
regression with interaction terms. As shown in Table the in-sample R^’s of PCR hover 
around 35%, and its out-of-sample R^’s are relatively low. Including interaction between 
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the first two PCs of the covariance of the predictors does not help mnch, as the interaction 
term is incorrectly specihed for this model. SFi picks np the correct form of interaction and 
exhibit better performance, especially when T gets reasonably large. 

4.3 Forecasting using semiparametric factor models 

When the factor loadings admit a semiparametric strnctnre, this additional information 
often leads to improved estimation accnracy. In this section, we consider a simple design 
with one observed covariate and three latent factors. We set the three loading fnnctions 
as gi{z) = z, g 2 {z) = z'^ — 1 and gs^z) = z^ — 2z, where the characteristic 2 ; is drawn 
from standard normal. We examine the same predictive model as ontlined in Section 4.2, 
which involves factor interaction. The only difference now is that onr loadings are covariate- 
dependent and we no longer have irrelevant factors. 


Out-of-sample ( T = 100) Out-of-sample (T = 200 ) 




Out-of-sample R^ ( T = 300 ) Out-of-sample R^ (T = 500 ) 




Fignre 2: Out-of-sample over 1000 replications. SF denotes the sufficient forecasting method 
without using extra covariate information. SF-projPCA uses projected-PCA method to estimate 
the latent factors, and SF-knownF uses the true factors from the data generating process. All the 
methods are based on 2 predictive indices. 
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Figure 2 presents the out-of-sample performance of three methods for different combina¬ 
tions of {p,T). The three methods differ in their ways of estimating the latent factors. In 
the hrst method, we simply use traditional PCA to extract underlying factors, whereas in 
the second method we resort to projected-PCA. For comparison, we also look at the per¬ 
formance when the true factors are used to forecast the target. In general, as T increases, 
the predictive power signihcantly increases, irrespective of the cross-sectional dimension p. 
For hxed T, the projected-PCA quickly picks up predictive power as p increases. Using 
extra covariate information helps up gain accuracy in estimating the underlying factors, and 
lifts up the out-of-sample very close to those obtained from using the true factors. This 
further demonstrates that with the presence of known characteristics in factor loadings, the 
projected-PCA would buy us more power in forecasting the factor-driven target. 


4.4 An empirical example 

As an empirical investigation, we use sufficient forecasting to forecast several macroe¬ 


conomic variables. Our dataset is taken from Stock and Watson (2012), which consists of 
quarterly observations on 108 U.S. low-level disaggregated macroeconomic time series from 
1959:1 to 2008:IV. Similar datasets have been extensively studied in the forecasting literature 


(Bai and Ng, 2008; Ludvigson and Ng, 2009). These time series are transformed by taking 
logarithms and/or differencing to make them approximately stationary. We treat each of 
them as a forecast target yt, with all the others forming the predictor set x*. The out- 
of-sample performance of each time series is examined, measured by the out-of-sample 


in (4.1). The procedure involves fully recursive factor estimation and parameter estimation 
starting half-way of the sample, using data from the beginning of the sample through quarter 
t for forecasting in quarter t + 1. 

We present the forecasting results for a few representatives in each macroeconomic cat¬ 
egory. The time series are chosen such that the second eigenvalue of 'Sf\y exceeds 50% of 
the hrst eigenvalue in the hrst half training sample. This implies that a single predictive 
index may not be enough to forecast the given target, so we could consider the ehect of 
second predictive index. Note that in some categories, the second eigenvalues 'Sf\y for all the 
time series are very small; in this case, we randomly pick a representative. As described in 
previous simulations, we use SF(1) to denote sufficient forecasting with one single predictive 
index, which corresponds to a linear forecasting equation with one regressor. SF(2) employs 
an additional predictive index on top of SF(1), and is ht by local linear regression. In all 
cases, we use 7 estimated factors extracted from 107 predictors, which are found to explain 


over 40 percent of the variation in the data; see Bai and Ng (2013). 


23 
















Table 5: Out-of-sample Macroeconomic Forecasting 


Category 

Label 

SF(1) 

SF(2) 

PCR 

PCI 

GDP components 

GDP264 

11.2 

14.9 

13.8 

9.2 

IP 

IPS13 

17.8 

21.0 

20.6 

-2.9 

Employment 

CES048 

24.1 

26.4 

23.4 

25.7 

Unemployment rate 

LHU680 

19.0 

21.7 

13.6 

21.1 

Housing 

HSSOU 

1.9 

4.4 

-0.8 

7.4 

Inventories 

PMNO 

19.8 

20.2 

18.6 

6.3 

Prices 

GDP275_3 

9.1 

11.3 

10.2 

0.2 

Wages 

LBMNU 

32.7 

34.2 

34.0 

29.1 

Interest rates 

EYEF 

3.6 

2.4 

-31.5 

1.6 

Money 

CCINRV 

4.8 

16.0 

1.3 

-0.5 

Exchange rates 

EXRCAN 

-4.7 

4.5 

-7.9 

-1.3 

Stock prices 

ESDJ 

-9.2 

8.8 

-14.5 

-7.7 

Consumer expectations 

HHSNTN 

-3.7 

-5.0 

-4.4 

0.0 


Notes: Out-of-sample for one-quarter ahead forecasts. SF(1) uses a single 
predictive index built on 7 estimated factors to make a linear forecast. SF(2) is 
fit by local linear regression using the first two predictive indices. PCR uses all 7 
principal components as linear regressors. PCI uses the first principal component. 


Table|^shows that SF(1) yields comparable performance as PCR. There are cases where 
SF(1) exhibits more predictability than PCR, e.g., Inventories and Interest rates. This is due 
to the fact that there are possible nonlinear effects from extracted factors on the target. In 
this case, the hrst predictive index obtained from our procedure accounts for such a nonlinear 
effect whereas the index created by PCR hnds only the best linear approximation and can not 
hnd a second index. SF(2) improves predictability in many cases, where the target time series 
might not be linear in the latent factors. Taking CCINRV (consumer credit outstanding) for 
example. Figure 3 plots the eigenvalues of its corresponding 'Sf\y, the estimated regression 
surface and the running out-of-sample R^’s by different sample split date. As can been seen 
from the plot, there is a non-linear effect of the two underlying macro factors on the target. 
By taking such effect into account, SF(2) consistently outperforms the other methods. 
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Eigenvalue 


Estimated regression surface 



Running out-of-sample 



Figure 3: Forecasting results for CCINRV (consumer credit outstanding). The top left panel shows 
the eigenvalues of "^fiy The top right panel gives a 3-d plot of the estimated regression surface. 
The lower panel displays the running out-of-sample R^’s for four methods described in Table 
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5 Discussion 


We have introduced the sufficient forecasting in a high-dimensional environment to fore¬ 
cast a single time series, which enlarges the scope of traditional factor forecasting. The key 
feature of the sufficient forecasting is its ability in extracting multiple predictive indices and 
providing additional predictive power beyond the simple linear forecasting, especially when 
the target is an unknown nonlinear function of underlying factors. We explicitly point out 
the interesting connection between the proposed sufficient forecasting and the deep learning 
(Bengio, 2009 Bengio et ah, 2013). Furthermore, the proposed method extends the sufficient 
dimension reduction to high-dimensional regimes by condensing cross-sectional information 
through (semiparametric) factor models. We have demonstrated its efficacy through Monte 
Carlo experiments. Our empirical results on macroeconomic forecasting suggest that such 
procedure can contribute to substantial improvement beyond conventional linear models. 
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6 Appendix 


We first cite two lemmas from Fan et al. (2013), which are needed subsequently in the proofs. 


Lemma 6.1. Suppose A and B are two semi-positive definite matrices, and that Amin (A) > Cp,T 
for some sequence Cp^T >0. //||A — B||= Op{cp^T), then 

||A-i-B-i=Op(c-|)||A-B||. 


Lemma 6.2. Let be the eigenvalues ofS in descending order and be their associated 

eigenvectors. Correspondingly, let he the eigenvalues of S in descending order and 

he their associated eigenvectors. Then, 


(a) Weyl’s Theorem: 

— Ail < ||5] — S 
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(b) sm(0) Theorem (Davis and Kahan, 1970): 




\/2 • IIS - SI 


min(|Ai_i - Aj|, |Aj - Ai+i|) 


6.1 


Proof of Proposition 


2.1 


Proof. It suffices to show that = Af,xt, or equivalently, F' = A^X in matrix form. First we let 
M = diag(Ai, • • • , Xr), where Aj are the largest K eigenvalues of X'X. By construction, we have 
(X'X)F = FM. Thus, 


B'B = T"2f'(X'X)F = T-^F'FM = T'^M. 


Now, it is easy to see that 

(B'B)"^B'X = rM"^(r"^F'X')X = M-i(X'XF)' = F'. 


This concludes the proof of Proposition 2.1 


□ 


6.2 Proof of Theorem 3.1 


Let V denote the K x K diagonal matrix consisting of the K largest eigenvalues of the sample 
covariance matrix T^^X'X in the descending order. Define a. K x K matrix 


H = (l/r)ViF'FB'B, 


( 6 . 1 ) 


where F' = (fi, • • • , fp). In factor models, the rotation matrix H is typically used for identifiability. 
With identification in Assumptions 3.1 (2), Bai and Ng (2013) showed that H is asymptotically an 
identity matrix. The result continues to hold under our setting. 


Lemma 6.3. Under Assumptions 3.1 (l)-(2), \3.1^ and \3.1^ we have 

\\11-Ik\\=Op{u:It) 


Proof. In view of Assumption A and Appendix B of Bai and Ng (2013), we only need to verify that 
Assumption A (c.ii) in Bai and Ng (2013) is satisfied under our models and assumptions, since all 
other assumptions of theirs are met here. To verity Assumption A (c.ii), it is enough to show that 
Ylt=i \E{uitUjs)\ < Ml and \E{uitUjs)\ < M 2 for some positive constants Mi, M 2 < 00 . 

Consider E{uitUjs). Since E\uit\^ < M for some constant M and any t by Assumption |3 . 3| ( 1), 
by using Davydov’s inequality (Corollary 16.2.4 in Athreya and Lahiri (2006)), there exist constants 
Cl and C 2 such that for all (i, j), 

\E{uitUjs)\ < Cl {a{\t - s|))^“8“8 < C'2/93l‘-'^l, 


where «(•) is the mixing coefficient, p G (0,1), and a (|t — s|) < C/ol* ®l by Assumption 3.2 Hence, 


with Ml = 2(172/(1 — p'i), we have 

T 


J]|F(u,in,,)| = J]C2p4l‘-l <2C2/(l-p3) = Mi. 


t=i 


t=i 
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In addition, it is easy to verify that there is a constant C 3 such that 

\E{uitUjs)\ < C3\E{uitUjt)\ = C'3|S„(i, j)|. 


By assumption, we also have 

p 


\E{uitUjs)\ < Cs \'Su{iJ)\ < CsIlS^lli < CsM. 


2=1 


2=1 


Thus, Assumption A (c.ii) in Bai and Ng (2013) is satisfied with M 2 = C^M. Therefore, with 
identification in Assumption |3.1[ 2), we have 

||H ~ lii'll = Op(^p,r)) 


which completes the proof of Lemma 6.3 


□ 

The next lemma shows that the normalization matrix A^ can be consistently estimated b y th e 


plug-in estimate Ah under the spectral norm, which is an important result to prove Theorem 


3.1 


Lemma 6.4. Under Assumptions 3.1 (l)-(2), |3..2| and 3.3. we have 
(a) ||B - B|| = 

(h) ||Ab - Aftll = Op{p~^/‘^ujp;T)- 
Proof, (a) First of all, 

||B-B|| = ||B-BH'-f B(H'-Ii^)|| 

< ||B-BH'|| + ||B(H'-Ii^)||, 

where H is defined in ( |6.1[) . The second term is bounded by ||B|| • ||H' — Ii^|| = = 

Opip^^‘^(jJp,T) by Lemma |6.3| and using the fact ||B IP — Amax(B'B) < C 2 P from Assumption 3.1. It 
remains to bound the first term. Using the fact that || • || < || • \\f, we have 


|B - BH'lp < ||B - BH'lll = ^ ||bi - Hb^ 


( 6 . 2 ) 


2 = 1 


Next, note that both bj = and hold as a result of the 

constrained least-squares problem (|2.10) and (2.11). Then, we have 


b,;-Hb. = 4 


T T 

yy(bxit - Hbxit) + y^ Hfi(fj'bi Uit) - Hb 


t=i 

T 


t=i 


= (b-3) 

t=i t=i 

where we have used the fact that ^ ~^K = 0. The remaining two terms on the right-hand 

side can be separately bounded as follows. 













Under Assumptions [37L] (1)-(2), 3.2 and 3.3, we have the following rate of convergence of the 
estimated factors, 




t=i 


This result is proved in Theorem 1 of Bai and Ng 


([ 200 ^. 


For the first term in (6.3), it is easy to see that = 0(1) and thus T ^ Ylt=i ~ Op{l)- 

Hence, we use the Cauchy-Schwarz inequality to yield 

IIi xuil - Hfi)|| < (^ E = Opiojp^T). 

t=i t=i t=i 

For the second term, by using the Cauchy-Schwarz inequality again, we have 

T T 

t=i t=i 

To bound ||bj — Hb^lp, we use the simple fact that (a + 6)^ < 2(a^ + b^). Therefore we combine 
two upper bonds to obtain that 

||B-BH'f < + 


i=l 

P 


t=l 


t=l 

T 


- ^ E + OpW • 




To 


bound X]f=i 11:^ J2t=i pointed out in the proof of 


Lemma 4 of 


Bai and Ng 


(2002), we 


only need to prove that E(^ J2i=i 11:^ J2t=i ^tUaW^) < Mi for some positive constant Mi. Next, 
we use the Cauchy-Schwarz inequality and the independence between {fi}t>i and {ut}t>i to obtain 


1 A 1 ^ 


fiiiitlP) = E{uisUit)E{^'^^t. 

^ i=i ^ t=i i' ■ - 


P i=l S=1 t=l 

< M -C {= Ml) 


i=l s=l t=l 
p T T 


3.2 


and ^ 'ZLi E s=i Et=i \E{uisUit)\ < M holds 


where Fl||fs|p < C holds due to Assumption 
due to Assumption |3. 31 (2). Then, we can follow Bai and Ng (2002) to show that 

p ^ T 


E II r?p ~ Opjp)- 

i=l t=l 


Now, it immediately follows from (6.2) that 

||B - BH'f = Opipujlr) + Op{p/T) = Op{pojIt)- 
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(b) Begin by noting that 

||B|| < ||B - B|| + ||B|| = Op{p"2i0p,T) + Opip^/^) = Op{p^/^). 

We use the triangle inequality to bound 

||B'B-B'B|| < ||B'B-B'B|| + ||B'B-B'B|| 

< ||B - B|| • ||B|| + IIB'II • ||B - B|| 

Note that ||B'B — B'B|| = Op{pujp^T) = Op{p) and Aniin(B'B) > cip hold by Assumption 3.1. Now, 
we use Lemma lb. 11 to obtain 


||(B'B)-i - (B'B)-i|| = Opip-^) ■ IIB'B - B'BII = Op{p-^Up,T). 

It follows that 

IIAb-Afell = ||(B'B)-iB'-(B'B)-iB'|| 

= ||(B'B)^^B' - (B'B)^^B'II + ||(B'B)“^B' - (B'B)^1b'| 
< ||(B'B)~^ - (B'B)~i|| • ||B|| + ||(B'B)^i|| • ||B' - B'|| 

= Op(p“^Wp,r) • Op{p^/^) + Op{p~^) ■ Opip^/^Up^x) 

= Op(p“^/^Wp,T)- 


which completes the proof of Lemma 6.4 


□ 


The following lemma lays the foundation of inverse regression, which can be found in Li (1991). 


Lemma 6.5. Under model (2.1) and Assumption 3.1 (3), the centered inverse regression curve 
E{{t\yt+i) — E{it) is contained in the linear subspace spanned by 0'^,cov(fi), A; = 1,..., L. 


We are now ready to complete the proof of Theorem 3.1 


Proof of Theorem 5^, In the first part, we derive the probability bound for ||Sj|y — 5]j|y||. Let 
rn/i = I Yl'i=i ^{h,i) denote the average of the predictors within the corresponding slice Ih, and uih = 
E{-xt\yt+i G Ih) represents its population version. By definition, ^^^(Abm/j)(A;,m/j)'. 

For fixed H, note that 


H 


= iL ^ y~][(Abmfe)(Afcmfe)^ - (Afem/i)(Afem/i)'] 


h=l 


Thus, 


H 

\'Ef\y - Ejiyll < ^ (^||Abm/i - Afem/ill • ||Afem/i|| + ||Abm/i|| • ||Abfn/i - Afem/i||^ (6.4) 


h=l 


We now bound each of the above terms. 
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From the definition, we immediately have 




1 

“ ~ ^E{it\yt+i G /h)|| 


1=1 


C C 

< l|B|| • l|- ^f(M) “ Ei^t\yt+1 G 4)11 + l|- 


{h,l)\ 


1=1 


1=1 


Under Assumption 3.2, the sample mean ^ X))=i ^ih,i) converges to the population mean E{it\yt+i G 
4) at the rate of Op(T“^4). This holds true as the random variable ft\yt+i G Ih is still stationary 
with hnite second moments, and the sum of the a-mixing coefficients converges (Theorem 19.2 in 


Billingsley (1999)). This applies to Ut\yt+i G 4 as well. Therefore, 

||m/j - m/j|| = Op{p^) ■ Op{T~^)+ Op{T~^) = Op{^/pJf). 

In addition to the inequality above, we also have 

I|m;,|| = \\E{^t\yt+i G 4)|| < ||B|| • \\E{it\yt+i G 4)|| = Op{p^/^). 

Thus, it follows that 

||m/i|| < ||mfe - m/j|| + \\mh\\ = Op{p^^^). 

Now, we use both the triangle inequality and the Cauchy-Schwarz inequality to obtain that 

IIAftin/i - Afem,i|| < || Afefn/j - Aftfh/jH + || Afein/j - Aftm/jH 

< IIAfc - Afcll • ||m;,|| + IIAfell • ||m,, - m/j|| 

= Op{p-^/^cvp,T) • Op{p^/^) + Op{p-^/^) • Op{^) 

— Op{ujp^T\ 


where we use ||Ab — HAft|| = Op{p~^/‘^ujp^T) by Lemma 
||Ab|| < ||(B'B)^^|| • ||B|| = Op{p~^/‘^) in the third equality. 
Also we have that 


6.4 


in the second inequality, and that 


IIAbnifell < {\\Ab - Abll + ||Afc| 




= Op{p ^/2) • Op(p^/^) = Op{l) 


and 


||Afem/j|| < IIAftll • ||m/j|| = Op{l). 


Hence, from (6.4), we conclude the desired result as follows: 

H 


l^/ly “ ^/|yll = H ^'^{Opiujp^T) ■ Op{l)+ Op{l) ■ Opiujp^r)) = Opiup^T)- 


h=l 


In the second part, we show that i/)^,..., are the desired SDR directions and further derive 
the probability bound for \\xj)j — -0^11 with j = 1,L. Let {Xi}fLi be the eigenvalues of in 
the descending order and be the eigenvalues of in the descending order. Let 0^ be 

the eigenvector associated with the eigenvalue Aj, and 0^ is the eigenvector associated with the 
eigenvalue Xj. 

In view of Lemma 6.5, E{^t\yt+i) is contained in the central subspace 5y|f spanned by 0;^,..., 0^, 
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since we have the normalization cov(fi) = and E{it) = 0. Recall the fact that the eigenvectors 
of corresponding to its L largest eigenvalues would form another orthogonal basis for 5'y|f. 
Hence, the eigenvectors i/’i, • • • ,'t/^L of ^f\y constitute the SDR directions for model (2.1). 

Note that |Aj_i — Aj_i| < ||Sj|y — holds by Weyl’s Theorem in Lemma 6.2 (a). Similarly, 

we have |Aj — Aj+il > |Aj — Aj+il — Op{uip^T)- Then, we have 

|Aj_i — AjI > |Aj_i — Xj\ — |Aj_i — Aj_i| 

— '^il Opijjjp^T^ 1 

where the fact that |Aj_i — Aj_i| < \\'^f\y — ^/|i/ll is used. 

Since have distinct eigenvalues, both |Aj_i — Aj| and \Xj — Aj_|_i| are bounded away from 
zero with probability tending to one for j = 1,..., L. Now, a direct application of sin(0) Theorem 
(]Davis and Kah^, 1970) in Lemma 6.2 (b) shows that 


11^7 -^,'11 < 


V2- W^fly - T,fly\\ 

min(|Aj_i - Xj\, |Aj - Aj+i|) 


— Op(l) • Opiujp^T^ — Opiujp^T^ 


where j = 1,..., L, 


Therefore, we complete the proof of Theorem 3.1 


□ 


6.3 Proof of Theorem 3.2 


Proof. First we write cf) = Y^'t=i Vt+i^t in terms of the true factors ft, i.e., 

-r T-i 


cf) = 


T- 

T- 


^ t=l 
T-1 


t=l 


where we used the fact that ft = A^xt as shown in Proposition 
Using the triangular inequality, we have 


2.1 


110-011 < 110-(A,B)0|| + ||(At,B-1^)011 

Afe 

r-1 


T— 1 T— 1 

ftyt+i + - (AbB)0|| + ||(A,B - 1^)011 


T- 1 


t=i 

T-l 


t=l 


T-l 


< 


J2{yt+ift - 0)11 + II x; Ut2/t+i|| + IKAfeB - lt,)0||. 


T-l 


i=l t=l 

In the sequel, we will bound three terms on the right hand side of this inequality respectively. 
By Lemmas 6.3 and 6.4 we have 

||AbB|| < (IIAfe - Abll + IIAfell) • ||B|| = Op{l) 
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Since ||0|| = Op(l), the third term on the right hand side of the inequality is Op{u}p^T), i-e., 

\\{AtB-lK)H=Op{ujp,T). 

For the second term, note that u* is independent of yt+i, hence E{utyt+i) = 0. By law of large 
numbers and ||Ab|| = Op(p~^^^), we have 

1 ^ 

II utyt+i\\ = Op{p • Op{T = Op(wp,T). 

t=i 

It remains to bound the first term ||ip^ ~ ^)\\- we express ft along 

the basis 0i, • • • , and their orthogonal hyperplane, namely, 

L 

ft = 

i=i 

By the orthogonal decomposition of normal distribution, (ft,0j) and are independent. In 
addition, yt+i depends on ft only through ■ , 4>'L^t, and then, it is conditionally independent 

of f{^. It follows from contraction property that yt+i and f^"^ are independent, unconditionally. 
Thus, 

i?(yt+ift^) = E{yt+i)E{f,^) = 0. 

Recall that ^ = Ylf=i ^{{4>j^t)yt+i)4>j- Now, we use the triangle inequality to obtain 


< 


< 


1 


T-l 


'T- 
, 1 




T- 

L 


t=l 

T-l L L 

Y X] + Vt+1^^ - 

t=l j=l j=l 

T-l , T-l 


7 = 1 i=l i=l 


i=i 

L 


t=l 

T-l 


L T-l T-l 

j=l '' t=l ^ t=l 


where we used the fact that ||0j|| = 1 for any j. Note that each term here is Op{T by Law of 
Large Numbers. Then, we have, 

T—1 

ii^^(X; yt+ift -m< iiAbBii • Op{T-i) = Op{uip,T). 

t=i 


This concludes the desired bound that \\cf) — cf)\\ = Op{i0p^T)- 
Therefore, we complete the proof of Theorem |3.2[ 


□ 
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